zip = read.csv("/course/data/zip/zip.csv")
subZip <- subset(zip, digit == 4 | digit == 9)
# head(subZip)
dim(subZip)
## [1] 1673 257
Dimension: The dimension of the data box is 1673x257, which means that there are 1673 observations and 257 variables.
Numbers and pixels: As we can see from the output of head(), each row represents the image data of a number, where the “digit” column represents this number, and the following columns represent the pixel value of the image. These pixel values are presumably normalized or standardized in some way, since they are all between -1 and 1.
Pixel interpretation: Each observation represents a 16x16 pixel image. For this 16x16 image, every “p…” The columns all represent the brightness or color intensity of a pixel on the image.
The data in question consists of pixel values from images, where each pixel can take on up to 256 distinct values. This kind of data presents several practical challenges:
Curse of Dimensionality: Given that images often contain millions to billions of pixels, using pixel values directly as input features introduces an extremely high dimensionality. In such high-dimensional spaces, many algorithms suffer in terms of performance, as data points tend to become equidistant, affecting distance-based methods like K-nearest neighbors.
Computational Costs: Handling high-dimensional data requires more computational resources. Training complex models, especially deep learning ones, on such data can be time-consuming and resource-intensive.
What is the OOB error using all predictors? What is the OOB error if only the two selected predictors are used in the Random Forest model?
library(randomForest)
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
subZip$digit <- as.factor(ifelse(subZip$digit == 4, "digit_4", "digit_9"))
select_data <- subZip[, c("digit","p9", "p24")]
set.seed(769)
# RF model
(r <- randomForest(digit ~ ., data=subZip,importance=TRUE))
##
## Call:
## randomForest(formula = digit ~ ., data = subZip, importance = TRUE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 16
##
## OOB estimate of error rate: 1.08%
## Confusion matrix:
## digit_4 digit_9 class.error
## digit_4 845 7 0.008215962
## digit_9 11 810 0.013398295
# OOB error using all predictors
(all_oob <- r$err.rate[nrow(r$err.rate), "OOB"])
## OOB
## 0.01075912
# Find the two most important variables
imp <- importance(r)
(top_2_vars <- rownames(imp)[order(-imp[, "MeanDecreaseGini"])][1:2])
## [1] "p24" "p9"
# RF model using only the two most important predictors
(r2 <- randomForest(digit ~ ., data=subZip[, c(top_2_vars, "digit")]))
##
## Call:
## randomForest(formula = digit ~ ., data = subZip[, c(top_2_vars, "digit")])
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 1
##
## OOB estimate of error rate: 8.19%
## Confusion matrix:
## digit_4 digit_9 class.error
## digit_4 773 79 0.09272300
## digit_9 58 763 0.07064555
# OOB error using only the two selected predictors
(top_2_oob <- r2$err.rate[nrow(r2$err.rate), "OOB"])
## OOB
## 0.08188882
The two most important predictors for classifying between observations with digit=4 and digit=9 are p24 and p9.
OOB using all predictors: 0.01075912
OOB using two predictors: 0.08188882
Compute the adjusted Rand indices for these clustering results, by comparing them with the true class labels. Does this unsupervised learning method do a good job for the supervised data set here, in particular when K=2?
library(mclust)
## Package 'mclust' version 6.0.0
## Type 'citation("mclust")' for citing this R package in publications.
select_data <- subZip[, c("digit","p9", "p24")]
cluster_data <- subZip[, c("p9", "p24")]
true_labels <- subZip$digit
ari = double(6)
# Standardise the data first
cluster_data = as.data.frame(scale(cluster_data))
# K-means Cluster
r0 = kmeans(cluster_data, centers=2) # K = 2
for(k in 2:7){
set.seed(769)
r = kmeans(cluster_data,centers = k)
ari[k-1] = adjustedRandIndex(r$cluster, true_labels)
}
ari
## [1] 0.7335104 0.6402202 0.5905239 0.5505367 0.4777205 0.4693441
# Evaluation for K=2
if (ari[1] > 0.5) {
cat("When K=2, the unsupervised learning method does a relatively good job on the supervised dataset.\n")
} else {
cat("When K=2, the unsupervised learning method does not perform well on the supervised dataset.\n")
}
## When K=2, the unsupervised learning method does a relatively good job on the supervised dataset.
# Setting up the plotting area for 1 row and 2 columns of plots
par(mfrow=c(1,2))
# Visualizing the actual distribution of digit
plot(select_data$p9, select_data$p24, col=ifelse(select_data == "digit_4", "blue", "red"),
xlab="p9", ylab="p24", main="Actual Distribution of Digit",
pch=20, cex=0.5)
legend("topright", legend=c("digit=4", "digit=9"), fill=c("blue", "red"))
# Visualizing the clusters when K=2
plot(cluster_data$p9, cluster_data$p24, col=ifelse(r0$cluster == 1, "green", "purple"),
xlab="p9", ylab="p24", main="K-means Clusters (K=2)",
pch=20, cex=0.5)
legend("topright", legend=c("Cluster 1", "Cluster 2"), fill=c("green", "purple"))
# Reset the plotting area to default
par(mfrow=c(1,1))
Considering the ARI value of 0.7335104 for K=2, the unsupervised learning method (K-means in this case) does a relatively good job at clustering the supervised dataset when partitioned into two clusters. It’s not perfect, but the value suggests that the clusters have a decent level of alignment with the actual digit labels of 4 and 9.
cex = 0.3
d=dist(cluster_data)
# Complete Linkage
par(mfrow=c(3, 2), mar=c(4, 4, 2, 1))
r=hclust(d)
# Loop for plotting clustering results from K=2 to K=7
for(k in 2:7) {
plot(select_data$p9, select_data$p24, col=cutree(r, k) + 2,
main=paste0("K = ", k), pch=20, cex=0.5, xlab="p9", ylab="p24")
}
# Single Linkage
par(mfrow=c(3, 2), mar=c(4, 4, 2, 1))
r = hclust(d, method="single") # single linkage
for(k in 2:7)
{plot(select_data$p9, select_data$p24, col=cutree(r, k) + 2,
main=paste0("K = ", k), pch=20, cex=0.5, xlab="p9", ylab="p24")
}